Annie Lu
CMSC320
Final Tutorial Project
Published December 15th, 2018
Update: Maps only work well with Safari.
Despite being a developed country, the United States has considerable amount of income inequality. As an American issue, I want to explore (1) where income inequality resides using maps and (2) the various factors which could affect income. Since the dataset is large, I will look at this at regional level, the East Coast, rather than a national level. Specifically, nearby DC, Maryland, Virginia, Pennsylvania, Delaware, and New Jersey. Another objective of this tutorial is to try understanding the American Community Survey (ACS), because it seems to be a reliable, applicable source for U.S populations.
Quite a few libraries are used in this tutorial. Some fundamentals for data science are pandas and numpy. Pandas provides flexible data structures, series and dataframes, along with useful functions to manipulate them. Numpy is a powerful mathematical library with an n dimensional array object, which performs fast calculations since it is based in C.
import pandas as pd
import numpy as np
Libraries for Data Collection:
from bs4 import BeautifulSoup
import requests
Libraries for Data Analysis and Data Visualization:
import matplotlib.pyplot as plt
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
import seaborn as sns
import folium
import branca
I'll also be using a tool called IPUMS which helps create a dataset of selected variables from one or multiple samples, such as the American Community Survey (ACS).
In this phase of the data life cycle, we collect data from both websites and files. A common file format is csv, which can be easily read with Pandas read_csv function. For websites, we use requests for the html, and properly parse it with the Beautiful Soup Library. To form a proper dataframe, we first prettify() and then use Pandas read_html function.
My data analysis/visualizations will, later on, rely on maps. Thus, it is necessary to collect geographical and location data. For a proper map with boundaries, I searched for json files which plotted either state or county boundaries by points. However, it was not enough to simply contain the boundaries; there also needed to be feature identification or properties, like state abbreviations and FIPS. The json files I ended up finding for the map, required corresponding state abbreviations and FIPS.
Some datasets did not have postal codes or FIPS, so I found a table online with both and parsed it using Beautiful Soup library.
r = requests.get("https://www.nrcs.usda.gov/wps/portal/nrcs/detail/?cid=nrcs143_013696")
state_code = BeautifulSoup(r.text, "html.parser")
state_code = state_code.find(id="detail")
state_code = state_code.prettify() #readable html table format
state_code = pd.read_html(state_code) #returns list of dataframes with only [0] index
state_code = state_code[0]
state_code.columns=state_code.loc[0]
state_code=state_code.drop(0,0)
state_code.head()
Finding population was necessary to understand why some counties were covered and not others. Most often, counties that were densely populated/largest would have identified FIPS.
county_population = pd.read_csv("county_population.csv", sep=',')
county_population=county_population[['STATE','COUNTY','STNAME','CTYNAME','POPESTIMATE2017']]
IDFIP = []
PERCENTAGE = []
for index,row in county_population.iterrows():
if(row['COUNTY']==0):
total_pop=row['POPESTIMATE2017']
PERCENTAGE.append(1)
else:
PERCENTAGE.append(row['POPESTIMATE2017']/total_pop)
fip=""
if(len(str(row['STATE']))==1):
fip+=str(0)
fip+=str(row['STATE'])
if(len(str(row['COUNTY']))==2):
fip+=str(0)
else:
if(len(str(row['COUNTY']))==1):
fip+=str(0)+str(0)
fip+=str(row['COUNTY'])
IDFIP.append(fip)
county_population['IDFIP']=IDFIP
county_population['PERCENTAGE']=PERCENTAGE
county_population.head()
To analyze income, we need data about the USA people and households. A staple dataset for this is the American Community Survey. A brief overview of the American Community Survey: this survey from the U.S Census Bureau collects data from American households on a monthly basis. There are a lot of variables which are recorded, including race, gender, age, income, household, etc.
There are various ways to collect ACS data, ranging in difficulty and flexibility. In general, from least to most difficulty/flexibility are the following: Census Reporter, American FactFinder, and IPUMS. In this tutorial, I will be using IPUMS. This tool allows us to custom select variables and samples ahead of downloading the file. The file can be formatted multiple ways, but I chose csv or "comma separated values".
The dataset has a massive amount of rows, yet still only covers 1% of the population. In general, for large groups like the entire US and states, this should be fine for drawing conclusions. The margin of error is somewhat insignificant at large levels. To prevent my computer from crashing and burning, I took a subset of this dataset, and focused on a state/regional level.
Throughout the data science life cycle is a lot of selecting. Pandas .loc and .iloc do a good job job with selecting elements and rows. Below I use loc with a boolean statement for selecting rows that matter. In this case, the rows I am selecting are people/households that reside within a certain state. This site has some further examples to look over.
us=pd.read_csv("usa_00006.csv", sep=',')
maryland=us.loc[us['STATEFIP'] == 24]
dc=us.loc[us['STATEFIP'] == 11]
virginia=us.loc[us['STATEFIP'] == 51]
penn=us.loc[us['STATEFIP'] == 42]
delaware=us.loc[us['STATEFIP'] == 10]
jersey=us.loc[us['STATEFIP'] == 34]
data=pd.DataFrame()
data=data.append(maryland)
data=data.append(dc)
data=data.append(virginia)
data=data.append(penn)
data=data.append(delaware)
data=data.append(jersey)
data.head()
Before processing the columns and rows to best suit the analysis, we must understand the variables. The description is pretty accessible on the IPUMS USA website. Below are the variables I selected, and a brief description of a few:
list(data)
In this phase of the data life cycle, we tidy data so that it is readable and easier to use for analysis. This could mean, taking subsets to make calculations faster, dropping unnecessary columns and rows w/ default unknown values. The ACS accounts for both households and people. Since multiple people exist within a household, there was some duplicate information, so I only keep one of the rows and drop the rest. This can be easily done with pandas drop_duplicates() function, which by default keeps one of the rows intact. There were also a few FTOTINC, or household income, values which were unknown/n/a, so I dropped those rows well.
Some variables were marked into numerical categories, but were more understandable through words, so I replaced those values. The .replace() function in dataframes can take different datatypes without problem.
data['SEX']=data['SEX'].replace(1,'Male')
data['SEX']=data['SEX'].replace(2,'Female')
data['RACE']=data['RACE'].replace(1,'White')
data['RACE']=data['RACE'].replace(2,'Black')
data['RACE']=data['RACE'].replace(3,'Native')
data['RACE']=data['RACE'].replace(4,'Chinese')
data['RACE']=data['RACE'].replace(5,'Japanese')
data['RACE']=data['RACE'].replace(6,'Asian/Pacific Islander')
data['RACE']=data['RACE'].replace(7,'Other')
data['RACE']=data['RACE'].replace(8,'Mixed')
data['RACE']=data['RACE'].replace(9,'Mixed')
avoid=data['INCTOT'].isin(['9999999','9999998'])
data=data[~avoid]
There are multiple ways to define income and income inequality. One way is by taking the total household income and dividing it by the square root of people in the household. The square root is to take into account to shared resources and equivalent consumer expenditures.
def adjust_income(d):
adjusted_hhincome=[]
d = d[d.FTOTINC != 9999999]
d=d[d.FTOTINC != 9999998]
household=d.drop_duplicates(subset='SERIAL',keep='first')
for e, df_household in d.groupby('SERIAL'):
adjusted_hhincome.append((df_household['FTOTINC']/np.sqrt([len(df_household)])[0]).values[0])
household['Adjusted_Income']=adjusted_hhincome
household=household.drop(columns=['PERWT','PERNUM','YEAR','DATANUM','SERIAL','CBSERIAL'])
return household
This dataframe includes all sampled households on East Coast states, Maryland, Virginia, DC, Pennsylvania, Delaware, New Jersey. Now we've adjusted the income to account for household expenditures.
east_coast=adjust_income(data)
east_coast.head()
Now during the data analysis phase, we determine proper statistics and formulas to use to observe trends. Often times, sklearn can be useful when you have a n-array/matrix of unweighted values. With a smaller dataset, weights can be simply observed through duplicate rows. However, with this large dataset, it is a lot more costly to do so. Instead of trying to have proper parameters for sklearn, I used the numpy library with simple, straightforward statistics. The numpy library allows for quick mathematical operations with n-arrays, like here with d['HHWT'] (which is the weights/count of each household representation.) Sklearn is for more complex statistics and machine learning.
Next I calculated income statistics of a given demographic. The ACS dataset has weighted households, meaning some households represent more of a population that others, so I had to take that into account while finding medians/percentile cut-offs.
Taking the median income, rather than the average income can give a more accurate representation of the general population. Median is robust, unlike averages, so the statistic will not be heavily effected by outliers, like extremely wealth people.
There a many ways of calculating income inequality. A straightforward approach is the decile ratio, where you take the bottom 10% cutoff and divide that by the top 90% cutoff. When processing weighted statistics, numpy functions can make the process easier by, for instance, returning an array's summation.
def income_stat(d):
#finding adjusted income inequality
d.sort_values('Adjusted_Income', inplace=True)
cumsum = np.cumsum(d['HHWT'])
cutoff = np.sum(d['HHWT']) / 2.0
ninety = np.sum(d['HHWT']) * 0.9
ten = np.sum(d['HHWT']) * 0.1
median = d.Adjusted_Income[cumsum >= cutoff].iloc[0]
num=d.Adjusted_Income[cumsum >= ninety].iloc[0]
denom=d.Adjusted_Income[cumsum >= ten].iloc[0]
if(denom==0):
income_inequality=np.nan
else:
income_inequality=num/denom
return [d['STATEFIP'].iloc[0],d['COUNTYFIPS'].iloc[0],median,num,denom,income_inequality]
print(income_stat(east_coast))
income_stats=pd.DataFrame(columns=['State','County','Median','Ninetieth','Tenth','90/10 Ratio'])
for state in [maryland,dc,virginia,penn,delaware,jersey]:
state=adjust_income(state)
income_stats=income_stats.append(pd.DataFrame([income_stat(state)], columns=['State','County','Median','Ninetieth','Tenth','90/10 Ratio']))
income_stats
def counties_stat(d):
county_fips=d['COUNTYFIPS'].unique()
county_stats=pd.DataFrame(columns=['State','County','Median','Ninetieth','Tenth','90/10 Ratio'])
for county_fip in county_fips:
x=d.loc[d['COUNTYFIPS']==county_fip]
x=adjust_income(x)
county_stats=county_stats.append(pd.DataFrame([income_stat(x)],columns=['State','County','Median','Ninetieth','Tenth','90/10 Ratio']))
return county_stats
counties_map_data=counties_stat(maryland)
counties_map_data=counties_map_data.append(counties_stat(virginia))
counties_map_data=counties_map_data.append(counties_stat(dc))
counties_map_data=counties_map_data.append(counties_stat(penn))
counties_map_data=counties_map_data.append(counties_stat(delaware))
counties_map_data=counties_map_data.append(counties_stat(jersey))
counties_map_data.head()
Data visualizations (maps!) help people understand where income inequality is concentrated and located, which was one of my main objectives. In general, maps are a great visual to provide geographical insight. Folium is a great tool for creating interactive maps, and works well with leaflet.js. The documentation gives a thorough overview on how to apply it. Here I used choropleth maps, but folium also has other options like markers and heatmaps as well.
I needed the income_stats dataframe to have corresponding postal codes. This way, I could map the other statistics to the json files feature id. Here's a link to learn more about mapping with json. Basically, each polygon is identified by its postal code in this case, which we use, later, as a key to plot data. I used dataframe's .loc[] which selects columns or rows based on the boolean statements within it.
Postal_Code = []
for index,row in income_stats.iterrows():
x=str(row['State'])
if(len(state_code.loc[state_code['FIPS']==x]['Postal Code'].values)==0):
Postal_Code.append(' ')
else:
Postal_Code.append(state_code.loc[state_code['FIPS']==x]['Postal Code'].values[0])
income_stats["Postal Code"] = Postal_Code
income_stats.head()
Using folium choropleth maps, I displayed income and income inequality based on location, both state and county.
state_geo='us-states.json'
m = folium.Map(location=[40, -75], zoom_start=6)
style = {'fillColor': '#00000000', 'color': '#00000000'}
folium.Choropleth(
geo_data=state_geo,
name='choropleth',
data=income_stats,
columns=['Postal Code', 'Median'],
key_on='feature.id',
fill_color='YlGn',
fill_opacity=0.7,
line_opacity=0.9,
legend_name='Median Income',
highlight=True
).add_to(m)
highlight={'fillColor': '#000000',
'color': 'green',
'weight': 3,}
folium.GeoJson(state_geo, name='tooltip',
tooltip=folium.features.GeoJsonTooltip(fields=['name'],aliases=['State:']), highlight_function=lambda a: highlight,
style_function=lambda x: style).add_to(m)
folium.LayerControl().add_to(m)
m
Maryland has highest median income out of the east coast states I chose to focus on.
IDFIP = []
for index,row in counties_map_data.iterrows():
if(len(str(row['County']))==1):
IDFIP.append(str(row['State'])+str(0)+str(0)+str(row['County']))
else:
if(len(str(row['County']))==2):
IDFIP.append(str(row['State'])+str(0)+str(row['County']))
else:
IDFIP.append(str(row['State'])+str(row['County']))
counties_map_data["IDFIP"] = IDFIP
counties_map_data.head()
state_geo='cb_2017_us_county_20m.json'
m = folium.Map(location=[40, -75], zoom_start=6)
folium.Choropleth(
geo_data=state_geo,
name='average income',
data=counties_map_data,
columns=['IDFIP', 'Median'],
key_on='feature.properties.GEOID',
fill_color='YlGn',
fill_opacity=0.9,
line_opacity=1,
legend_name='Median Income',
highlight=True
).add_to(m)
highlight={'fillColor': '#ffaf00',
'color': 'green',
'weight': 3,}
folium.GeoJson(state_geo, name='tooltip',
tooltip=folium.features.GeoJsonTooltip(fields=['NAME'],aliases=['County:']), highlight_function=lambda a: highlight,
style_function=lambda x: style).add_to(m)
folium.LayerControl().add_to(m)
m
The darker the color, the more wealthy a county is. In this map we can tell areas like Montgomery, Howard, and Hunterdon County are doing well. However, areas like Baltimore and Philidelphia have lower median income. Western Pennsylvania shows lower income as well.
county_geo='cb_2017_us_county_20m.json'
m = folium.Map(location=[40, -75], zoom_start=6)
folium.Choropleth(
geo_data=county_geo,
name='average income',
data=counties_map_data,
columns=['IDFIP', '90/10 Ratio'],
key_on='feature.properties.GEOID',
fill_color='YlGn',
fill_opacity=0.9,
line_opacity=1,
legend_name='Income Inequality',
highlight=True
).add_to(m)
highlight={'fillColor': '#ffaf00',
'color': 'green',
'weight': 3,}
folium.GeoJson(state_geo, name='tooltip',
tooltip=folium.features.GeoJsonTooltip(fields=['NAME'],aliases=['County:']), highlight_function=lambda a: highlight,
style_function=lambda x: style).add_to(m)
folium.LayerControl().add_to(m)
m
Here we can se areas like Baltimore, Essex, and Centre County have more income inequality than the rest. Analyzing this, a policy suggestion could be to focus on these counties for income inequality and make changes within the public services.
county_geo='cb_2017_us_county_20m.json'
m = folium.Map(location=[40, -75], zoom_start=6)
highlight={'fillColor': '#ffaf00',
'color': 'green',
'weight': 3,}
folium.GeoJson(state_geo, name='tooltip',
tooltip=folium.features.GeoJsonTooltip(fields=['NAME'],aliases=['County:']), highlight_function=lambda a: highlight,
style_function=lambda x: style).add_to(m)
folium.Choropleth(
geo_data=county_geo,
name='population',
data=county_population,
columns=['IDFIP', 'PERCENTAGE'],
key_on='feature.properties.GEOID',
fill_color='YlGn',
fill_opacity=1,
line_opacity=0.9,
legend_name='Population',
highlight=True
).add_to(m)
folium.LayerControl().add_to(m)
m
I acknowledge this does not take into consideration weights, but I wanted to show the seaborn library visualizations anyways. This is a useful library to plot both categorical and quantitative data. Here I am using categorical plots with pandas dataframes. For more on pandas with seaborn, check this out.
sns.catplot(x="RACE", y="FTOTINC", kind="box", data=data, aspect=22/12)
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.barplot(x="RACE", y="INCTOT", hue="SEX", data=data, ci=None)
Box plots and violin plots are good at showing variable (income) distribution for different groups. The boxplot above is comparing race to income. It shows that among the racial groups, Asians like Chinese and Japanese have the most differences in income within the 25-75 percentile. Then, with the bar graph, I looked at individual income to see differences between male and female pay. For all racial groups, men earn more than women. These are generalizations I found from this graph, but can not be entirely applied since it doesn't consider the household/person weight of the ACS sample.